Journal of Chemical Theory and Computation — Latest Matching Preprints

1

Collinearity of Decomposed Energy Terms in MM-GBSA Binding Free Energy Calculations

Sevim, A.; Kocak, A.

2026-06-29 biophysics 10.64898/2026.06.24.734195 medRxiv

Top 0.1%

38.4%

Show abstract

The molecular mechanics-generalized Born surface area method (MMGBSA) is one of the most commonly used end state approaches used for the calculation of the binding free energy towards computational drug design and screening studies. It is customary to break up the free energy into van der Waals, electrostatic, polar solvation (GB), and nonpolar solvation (SA) terms and then either correlate these terms with experiment or assign physical meaning to each term. Here, we demonstrate that this assumption of independent fitting coefficients for decomposed energy terms could be invalid. Through analytic derivation and large-scale molecular dynamics simulations, we show that (i) the protein and ligand Coulomb interaction energy and the GB solvation correction are almost perfectly collinear (R2[≥]0.99) reflecting their designed role as vacuum electrostatics plus solvent screening, and (ii) the van der Waals interaction and SA term likewise exhibit strong correlation, as both depend primarily on buried surface area. Interaction entropy and C2 entropy corrections are also found to be strongly dependent on underlying electrostatic fluctuations, further reinforcing redundancy. These findings hold both at the level of instantaneous trajectory fluctuations and when averaged across a diverse set of 139 protein-protein complexes and persist in both single-trajectory and three trajectory MMGBSA protocols. Our results caution against using decomposed MMGBSA terms as independent predictors in regression models and suggest instead combining correlated terms into effective polar, nonpolar, and entropic contributions. Our study provides a systematic diagnosis of collinearity in MMGBSA and highlights pathways toward more interpretable and statistically robust predictive modeling.

2

Solvation Shapes the Conformational Landscape of a Therapeutically Relevant SMN2 Splice-Site Defect

Khaled, M.; Leuschner, L.; Palomino/Hernandez, O.

2026-07-06 biophysics 10.64898/2026.07.01.735918 medRxiv

Top 0.1%

26.5%

Show abstract

The SMN2 exon 7 5' splice-site/U1 snRNA duplex contains an A$_{-1}$ bulge that weakens splice-site recognition and represents a therapeutically relevant RNA connectivity defect, yet its conformational landscape and coupling to solvation remain poorly understood. Here, we performed enhanced-sampling Hamiltonian replica-exchange molecular dynamics simulations of the SMN2 splice-site duplex using four explicit-solvent models (OPC, TIP4P-Ew, TIP3P, and SPC/E) and characterized the sampled ensemble using linear and machine-learned latent representations. Across representations, the A$_{-1}$ defect consistently populated three metastable conformational states distinguished by local duplex geometry, base stacking, hydrogen-bonding patterns, and solvent exposure. The relative populations of these states, together with first-shell hydration and Na$^+$ distributions around the defect, varied substantially across water models, demonstrating that hydration and ion organization actively shape the equilibrium between locally accommodated and solvent-exposed conformations of the SMN2 splice-site bulge. Our results shed light on the conformational components of this therapeutic RNA target and highlight the impact of solvation model as an important consideration for molecular simulations of RNA splice-site recognition and small-molecule repair.

3

Accurate ΔTm Prediction Without Protein Structure Inputs for Biomolecular Stability

Siegismund, D.; Wieser, M.; Natali, E.; Steigele, S.

2026-07-06 bioinformatics 10.64898/2026.07.02.735991 medRxiv

Top 0.1%

15.0%

Show abstract

Predicting protein stability, like changes in melting temperature ({Delta}Tm) caused by mutations, is a critical task in therapeutic protein engineering and drug discovery. This is reflected by a growing solution space, including both AI-based sequence and structure based methods. This paper demonstrates that accurate {Delta}Tm prediction does not require structural input features, but can achieve state-of-the-art results with a careful training design for large sequence-based protein language models. We combine an autoresearch-inspired setup search with controlled ablation studies and show that a well-tuned sequence-only ESM2-650M model outperforms structure-informed methods in our benchmark, achieving the lowest error (MAE/RMSE) and competitive Pearson correlation without pH or structural inputs. We further show that choices such as loss function, pooling strategy, auxiliary supervision, and fine-tuning regime materially affect performance.

4

Structural Topology-based Electrostatic Model (STEM) Reveals Ion-Coordination Exchange as a Driver of RNA Folding Dynamics

Mainan, A.; Jaiswar, A.; Onuchic, J. N.; Sanbonmatsu, K. Y.; Roy, S.

2026-07-01 biophysics 10.64898/2026.06.27.734987 medRxiv

Top 0.1%

13.2%

Show abstract

RNA is a highly charged polyelectrolyte whose folding into functional architectures depends on an ionic atmosphere that screens strong electrostatic repulsion along the phosphate backbone. Whereas monovalent ions primarily stabilize secondary structure, divalent magnesium (Mg2+) drives tertiary folding often via site-specific and adopting various dynamic coordination modes. Current RNA structure-prediction frameworks rely largely on static direct-contact information, overlooking ion-mediated interactions and the dynamic exchange between distinct coordination modes-particularly the dynamic exchange between direct (inner) and solvent-separated (outer-sphere) Mg2+-phosphate coordination that often controls RNA's conformational transition. Here, we introduce the Structural-based Electrostatic Model (STEM), a hybrid implicit-explicit framework that explicitly captures how the dynamic exchange between distinct ion-coordination modes dictates folding pathways. STEM combines explicit Mg2+ ions to resolve site-specific interactions with implicit K+ ions to describe counter-ion condensation mediated electrostatic screening through generalized Manning counter-ion condensation model, enabling computationally efficient exploration of RNA folding landscapes. The model accurately reproduces crystallographic ion-binding sites, experimental preferential ion-interaction coefficients, and Small-Angle X-ray Scattering (SAXS)-derived radii of gyration across diverse RNA systems. Applied to a 58-nt rRNA fragment, STEM reveals that folding from an intermediate to the native state is driven by a chelated Mg2+-mediated tertiary contact and captures the resulting coordination-dependent conformational breathing. By shifting the paradigm from static direct-contact descriptions to ion-mediated dynamic interactions, STEM provides a physically grounded framework for predicting dynamic ensembles of RNA structures, resolving their folding free-energy landscapes, and elucidating the mechanisms of RNA folding and function beyond native conformations across physiological salt conditions.

5

AI-guided discovery for low-resource peptide engineering using evolutionary scale modeling

Andrekson, L.; Rydbergh, R.; Mercado, R.; Wenzel, M.

2026-07-01 bioinformatics 10.64898/2026.06.25.734678 medRxiv

Top 0.2%

9.8%

Show abstract

Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering. Yet, it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that the cross validation R2 score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20-500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. SCARSE significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while comparable performance was achieved on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies. Notably, we demonstrate that CV R2 computed from as few as 50 labeled peptides can be sufficient to estimate final active learning end-point performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.

6

A Comprehensive Evaluation of Protein Structure Prediction Models for Short Peptides

Ghosh, B.; MUKHERJEE, A.

2026-07-03 biophysics 10.64898/2026.07.02.736085 medRxiv

Top 0.2%

9.7%

Show abstract

Short peptides pose distinct challenges for computational structural biology due to their lack of stable tertiary structures, high conformational flexibility, and limited evolutionary signals. To address how modern deep-learning architectures navigate these challenges, we conducted a comprehensive benchmarking of five state-of-the-art protein structure prediction models: AlphaFold2, RoseTTAFold2, ESMFold, OmegaFold, and DMPfold2. Using a curated dataset of experimentally determined short peptide structures (10-49 amino acids) from the Protein Data Bank, we systematically evaluated predictive performance across varying sequence lengths and secondary structure classes. Our results demonstrate that prediction accuracy systematically improves with peptide length. Furthermore, all models perform significantly better on -helical and mixed-structure peptides compared to {beta}-sheet-rich and intrinsically disordered sequences. Among the evaluated methods, AlphaFold2 and the single-sequence language models, ESMFold and Omegafold proved to be the most consistent and accurate overall. We also observed that internal model confidence scores are imperfectly calibrated for short peptides, necessitating cautious interpretation. Finally, by extending our analysis to the dbAMP3 dataset of uncharacterized antimicrobial peptides, we demonstrate that a multi-model consensus approach provides a rational framework for identifying robust structural hypotheses in the absence of experimental reference structures.

7

Homology-aware cross-validation strategies for generalization assessment in RNA structure prediction

Bugnon, L.; Kulemeyer, G.; Gerard, M.; Di Persia, L.; Stegmayer, G.; Milone, D. H.

2026-06-29 bioinformatics 10.64898/2026.06.28.735057 medRxiv

Top 0.2%

9.5%

Show abstract

RNA secondary structure prediction is a fundamental challenge in bioinformatics, essential for understanding the functional roles of non-coding RNAs. Recently, deep learning models have transformed the field with impressive results, leading to critical discussions regarding the validity of current cross-validation strategies. On the one hand, traditional random partitioning yields overop-timistic results due to data leakage from uncontrolled homology. On the other hand, removing from the training set all sequences that exhibit even the slightest resemblance to the testing sequences penalizes learning-based methods by requiring generalization to completely out-of-distribution sequences. While it is very simple to remove sequences and retrain a machine learned model, it is very difficult to remove the experimental data used for parameter tuning and the sequences used for the development of classical thermodynamic methods. Thus, these methods often benefit from an implicit knowledge leakage. In this work we critically review existing cross-validation strategies for RNA secondary structure prediction: random splitting, clustering-based splitting, and leaving one RNA family out for testing. We analyze the advantages and limitations of each strategy, also expanding them towards the future directions to ensure fair comparisons across the full range of sequence similarities, with the same rigor for both classical and learning-based methods.

8

ThermoFusion: A Multimodal Deep Learning Framework for Generalizable Prediction of Enzyme Thermostability

Wei, Y.; Eberini, I.; Meyer, F.

2026-07-07 bioinformatics 10.64898/2026.07.04.736494 medRxiv

Top 0.2%

8.7%

Show abstract

Protein thermostability is a critical property for both industrial and biomedical enzyme applications, yet experimental evaluation of mutation-induced stability changes remains laborious and costly. Here, we present ThermoFusion, a hybrid deep learning framework that integrates 3D protein structure embeddings from ThermoMPNN with sequence-based embeddings from the pretrained protein language model ESM2 to predict the effects of single-point mutations on protein stability ({Delta}{Delta}G). ThermoFusion exhibits robust generalization, maintaining high predictive accuracy across out of distribution sequences with low identity to the training set -- a scenario where many other machine learning models, including ThermoMPNN and state-of-the-art tools, perform poorly due to reliance on memorization. Benchmarking on a curated enzyme dataset comprising of 105 enzymes and 3144 mutations shows that ThermoFusion reliably identifies stabilizing mutations while accurately predicting stability for enzymes beyond its training set. These results establish ThermoFusion as a powerful tool for rational enzyme design beyond its training set.

9

Solvent-buffer effects in molecular dynamics simulations of nucleic acids

Baghel, N.; Shrivastava, P.; Mehra, R.

2026-07-06 biophysics 10.64898/2026.07.05.736650 medRxiv

Top 0.3%

7.6%

Show abstract

Molecular dynamics simulations of nucleic acids are performed using a solvent-buffer distance of 10 [A] between the solute surface and the simulation box boundary. Although this cell size has been extensively explored in protein simulations, its implications for nucleic acid dynamics are not well understood. Nucleic acids are elongated, highly charged, and flexible structures with hydration and dynamical properties distinct from those of proteins and therefore, they may require different solvent-layer considerations in simulations. In this study, we investigated the effect of simulation cell size on nucleic acid dynamics by simulating a 30-base-pair double-helical nucleic acid structure and its two single-stranded forms using solvent-buffer distances of 3, 5, 10, 15, and 20 [A]. Smaller cells may impose restricted hydration, molecular crowding, and periodic image interactions. However, larger cells provide solvent space for conformational relaxation. A total of 45 s of molecular dynamics simulations were performed (3 structures x 5 cell sizes x 3 replicates x 1 s). Our results show that while the commonly used 10 [A] buffer may be sufficient to maintain the stability of the double-stranded nucleic acid, larger cells are required to capture the conformational dynamics of single-stranded structures. In both, increasing the cell size to 15 or 20 [A] enables broader conformational sampling. The first hydration shell exhibits reduced crowding in the 20 [A] cell, consistent with more relaxed conformations. At larger cell sizes, single-stranded nucleic acids adopt compact, self-associated conformations for stability. Together, this study presents physical insight into how simulation cell size and solvent environment influence nucleic acid dynamics.

10

Directional information flow as a tool for analyzing protein allostery

Yovanno, R. A.; Lau, A. Y.

2026-06-23 biophysics 10.64898/2026.06.22.733418 medRxiv

Top 0.3%

7.2%

Show abstract

The ability to tune protein function through the binding of modulatory ligands enables the development of therapeutics that steer a biological system away from dysfunctional states underlying disease. Understanding the dynamic mechanisms by which allosteric ligands alter protein function remains an important open question. Dynamical network models allow us to quantify information flow between protein functional sites. However, existing network models use time-symmetric metrics for computing information from correlated residue motions extracted from molecular dynamics (MD) simulations, failing to fully capture directional information flow between sites. Here, we developed a Python library, TEntroPy, and analysis workflow using transfer entropy to generate a directional protein network from equilibrium MD trajectories. Applying this workflow to proteins with known allosteric ligands, we identified residues in both allosteric and orthosteric (primary) binding sites acting as broadcasters and receivers of information. We then computed optimal paths of directional information flow between binding sites. The presence of temporal asymmetry in residue coupling identified from simulations of the unbound (apo) state suggests that directional information flow is encoded in the intrinsic dynamics of the protein. To test this, we perturbed key binding-site residues and demonstrated that our TE-weighted network captures perturbation-induced changes in dynamics along communication routes between binding sites. Identifying residue pairs with high temporal asymmetry provides an additional tool for understanding the dynamic mechanisms of allosteric communication.

11

De novo design of ligand binding and sensing with a physics based generative approach

Zhang, Y.; Ke, Y.; Zhi, R.; Jin, Q.; Feng, Y.; Wang, C.; Fang, M.; Liao, J.; Chen, D.; Liu, J.; Cao, L.

2026-07-14 bioinformatics 10.64898/2026.07.13.738243 medRxiv

Top 0.3%

7.2%

Show abstract

The de novo design of ligand-binding proteins has tremendous potential to revolutionize biosensor technology, yet converting these designs into functional sensors remains a major challenge due to the need for ligand-induced conformational changes or modulation of protein-protein interactions. Here, we introduce a physics-based generative approach for the de novo creation of proteins that bind small molecules and metal ions. Our method achieves customizable ligand-binding pocket formation in parallel with simulated protein folding, allowing for precise architectural control of the protein-ligand complex and facilitating the development of biosensors based on either ligand-triggered protein reassociation via split-protein reassembly or ligand-induced protein folding. We demonstrate the versatility of our computational method through successful designs targeting five small molecules, including the very small neurotransmitters serotonin and dopamine, and two metal ions. Biophysical characterization confirmed correct ligand binding, and crystal structures closely matched computational models. We demonstrated the biosensor engineering potential of these designs by constructing serotonin and dopamine sensors using a split protein strategy and explored several approaches to enhance sensor activity. Additionally, we developed a zinc sensor through a zinc-induced protein folding mechanism. Overall, our physics-based generative approach provides a robust framework for the de novo design of ligand-binding proteins, opening new avenues for the development of ligand-responsive biosensors.

12

PEPstrMOD2: Next-generation tertiary structure prediction of chemically modified and non-natural peptides

Jain, S.; Mehta, N. K.; Raina, S.; Kumar, P.; Varun, ; Raghava, G. P. S.

2026-07-06 bioinformatics 10.64898/2026.06.22.733733 medRxiv

Top 0.3%

6.5%

Show abstract

While most existing methods are limited to predicting the tertiary structures of proteins containing only canonical residues, the PEPstrMOD server (developed in 2015) pioneered structure prediction for chemically modified and non-natural peptides. Despite its widespread use, the original framework was restricted to peptides of 7 to 25 residues and relied on older backbone-prediction algorithms. To address these limitations, we present PEPstrMOD2, which introduces three major advancements over its predecessor. First, it replaces the original in-house coordinate generation with state-of-the-art deep learning (DL) algorithms, leveraging AlphaFold2 and ESMFold for highly accurate initial structure prediction. Secondly, it greatly expands the accessible chemical space through incorporation of new, AMBER force-field compatible library of 257 post-translational modifications (PTMs), 428 non-canonical amino acids (NCAAs), and 243 terminal modifications. Lastly, through the application of native scalability of AlphaFold2 (AF2) and ESMFold (EF), PEPstrMOD2 eliminates the original restrictions of the length, enabling the structural modeling of longer, complex therapeutic peptides and small proteins. We evaluated the performance of PEPstrMOD2 against state-of-the-art methods across three distinct peptide datasets. For the AfCyc dataset consisting of 80 cyclic peptides, PEPstrMOD2 obtained a competitive average atom-level Root Mean Square Deviation (RMSD) of 2.05 angstroms, compared to 1.13 angstroms by AlphaFold3 (AF3) and 1.82 angstroms by AfCycDesign. Remarkably, for the modified peptide ModPep433 dataset, PEPstrMOD2 outperformed AF3, achieving the lower average RMSD score of 4.49 angstroms against 4.67 angstroms of AF3. Furthermore, in the case of the ModPep16 benchmark, PEPstrMOD2 achieved 2.50 angstroms average RMSD value, which is two times more accurate than that of the original PEPstrMOD (5.84 angstroms). In summary, PEPstrMOD2 provides a powerful, high-throughput, and highly accurate platform to facilitate peptide-based drug development and structural biology research. While the original PEPstrMOD was restricted to a web server interface, PEPstrMOD2 is available as both an intuitive webserver and a standalone command-line tool via GitHub, featuring Docker support for easy deployment and reproducible, large-scale modeling pipelines (https://webs.iiitd.edu.in/raghava/pepstrmod/).

13

SAS_MoCa: a software for small-angle scattering data analysis of large unilamellar vesicles

Semeraro, E. F.; Pabst, G.

2026-07-02 biophysics 10.64898/2026.06.29.735169 medRxiv

Top 0.3%

6.3%

Show abstract

Small-angle X-ray or neutron scattering (SAXS/SANS) analysis of large unilamellar vesicles (LUVs) is often limited by high-dimensional bilayer models and the lack of dedicated, statistically rigorous workflows. Here, we introduce SAS_MoCa, an open-source Python package that integrates a compositional scattering density profile (SDP) description of lipid bilayers with a separated form factor (SFF) treatment of vesicle size and polydispersity, and couples these highly parameterized models to an adaptive thermodynamic simulated annealing algorithm formulated within a constrained Bayesian framework. SAS_MoCa enables users to incorporate quantitative prior information from, e.g., previous SAXS/SANS studies, dynamic light scattering, NMR, or molecular simulations, and returns full posterior parameter distributions, uncertainties (reported as medians and median absolute deviations) and correlations even from single SAXS curves. Validation on POPC, POPE and DMPC SAXS-only data demonstrates that the method yields reproducible structural parameters with uncertainties comparable to joint SAXS/contrast-variation SANS analyses. The modular architecture of SAS_MoCa facilitates extension to additional lipid systems and future joint SAXS/SANS or SANS-only applications.

14

Hydrophobic mismatch induces lipid sorting based on tail unsaturation

van Hilten, N.; Grabe, M.

2026-06-29 biophysics 10.64898/2026.06.23.734047 medRxiv

Top 0.4%

5.1%

Show abstract

Biological membranes contain a diverse set of membrane proteins surrounded by many different lipids, and the lateral organization and function of these molecules are closely intertwined. Here, we use coarse-grained molecular dynamics (MD) simulations to explore how hydrophobic mismatch between the length of transmembrane (TM) proteins and the thickness of the surrounding lipid membrane impacts the spatial distribution of the lipids. We constructed idealized cylindrically symmetric proteins, inspired by the Mattress Model developed in the 1980s, and simulated these model proteins in different lipid compositions. We found that unsaturated lipids were attracted to short TM proteins that thinned the membrane, while fully saturated lipids were attracted to long TM proteins that induced membrane extension. A simple mechanical description of the membrane deformation energy coupled to a lipid mixing model accurately predicted the enrichment/depletion, which was up to 33% in some cases. Our simulations also highlight that lipid sorting behavior is sensitive to protein tilt and protein surface roughness. By teasing out the fundamental physical principles in these simple models, our results provide a foundational understanding of how proteins and lipids form complex and transient assemblies, which we believe will be important for interpreting lipid-protein interactions for a host of membrane proteins that regulate cellular membranes and cell function.

15

Membrane Thickness Strain from Protein Inclusions: A Multiscale Simulation and X-Ray Scattering Study of Proteoliposomes

Semeraro, E. F.; Bartos, L.; Piller, P.; Deb, R.; Keller, S.; Vacha, R.; Pabst, G.

2026-07-08 biophysics 10.64898/2026.07.03.736288 medRxiv

Top 0.4%

4.9%

Show abstract

Integral membrane proteins remodel the surrounding lipid bilayer, but quantifying the resulting deformations and linking them to protein density in the membrane has remained challenging. Here, we introduce an integrative methodology that combines all-atom molecular dynamics (MD) simulations with multiscale small-angle X-ray scattering (SAXS) analysis to connect membrane strain to the protein/lipid ratio in proteoliposomes. Using outer membrane phospholipase A (OmpLA) reconstituted into lipid bilayers with both increased and decreased hydrophobic thickness, we systematically probe the effects of positive and negative hydrophobic mismatch.MD simulations demonstrate that OmpLA causes anisotropic, oscillatory thickness deformations extending up to eight times the radius of the first lipid shell surrounding the protein, yet the net change in average membrane thickness remains below 1%. Through our multiscale SAXS analysis, we quantitatively extract structural parameters, ranging from proteoliposome size to internal membrane architecture, using constrained Bayesian inference, with priors derived from MD findings. Specifically, we determine the protein/lipid molar ratio and average membrane strain, revealing excellent agreement between experiment and simulation. In thinner bilayers, substantial protein loss limits the analysis, highlighting the role of bilayer stability in sample preparation. Moreover, the predominance of OmpLA monomers in the thicker membranes is consistent with weak, membrane-mediated repulsive interactions between protein inclusions. Collectively, this integrative approach establishes a framework for quantifying protein-lipid interactions across molecular and mesoscale dimensions.

16

Protein hydration and druggability

Panasenko, S.; Khorev, V.; Petukhov, M.

2026-07-08 biophysics 10.64898/2026.07.06.736750 medRxiv

Top 0.4%

4.4%

Show abstract

A priori assessment of target proteins' druggability remains an unsolved problem in the field of drug development. The empirical approaches widely used to solve this problem demonstrate low efficiency. In this work, we investigated the factor of hydration of a representative set of 65 evolutionarily and structurally unrelated human enzymes in a water environment. This factor depends only on the structure of the proteins, and not on the physical and chemical properties of any potential ligands. The results show that, unlike the widely used approaches based on calculations of the accessible surface area (ASA), the content of low-entropy water molecules (LEW) in the active sites of human enzymes is systematically higher than that in other areas of their surface, including inactive cavities. Optimal criteria and a step-by-step procedure for identifying protein ligand binding sites are proposed. The proposed approach, based on the calculation of the LEW content in the first hydration layer of potentially interesting target proteins, makes it possible to evaluate their medicinal suitability even before the development of any ligands. The article also presents the results of a comparative analysis of experimental Raman spectroscopy data and the results of molecular dynamics simulations of water hydrogen bonds using three widely used water models (TIP3P, OPC3, and TIP5P) and standard algorithms for calculating hydrogen bond networks.

17

MAERM: Predicting Enzyme-Reaction Matching Relationships with a Mixed-Attention Model

Liu, T.; Zhai, S.; Lin, S.; Zhan, X.; Deng, J.; Liu, H.; Siu, S. W. I.

2026-07-10 bioinformatics 10.64898/2026.07.06.736902 medRxiv

Top 0.4%

4.3%

Show abstract

Harnessing enzyme specificity requires a thorough understanding of enzyme promiscuity, which determines enzymes catalytic scope; however, measuring this scope still relies heavily on labor-intensive analytical approaches. While data-driven approaches have emerged to predict the catalytic scope of enzymes, these methods continue to face challenges such as restricted datasets and insufficient integration of enzyme structural information and reaction transformations. Here, we introduce MAERM, an innovative mixed-attention model designed to predict enzyme-reaction matching relationships. Built on our MAERM-DB, a dataset with broad coverage of validated and chemoenzymatic catalysis data, MAERM utilizes a local-global attention module to integrate multimodal enzyme information with fine-grained reaction representations, thereby predicting enzyme-reaction matching probabilities. Results show that MAERM consistently outperforms all baselines, with an average F1-score of 0.984. Notably, on challenging test samples with less than 40% sequence identity to the training set, MAERM outperforms the second-ranked model by 5.9% in F1-score. In addition, MAERM achieves the highest top-10 success rate of 51.7% on Enzyme-405 and the highest balanced accuracy of 0.697 on BioCat-547, further supporting its generalizability in enzyme screening and chemoenzymatic catalysis. Finally, MAERM can serve as an efficient scoring module. When integrated with ProteinMPNN, MAERM has successfully guided novel enzyme design for two carbonyl reduction reactions, resulting in enhanced catalytic potential for the native substrate and demonstrating broad compatibility. Overall, MAERM has the potential to reduce the experimental cost of measuring enzymes catalytic scope, facilitate enzyme design, and ultimately accelerate the design-build-test-learn cycle in enzyme engineering.

18

A general thermodynamic approach for model reduction of enzyme cycles and electrogenic transporters

Pan, M.; Gawthrop, P. J.; Cursons, J.; Crampin, E. J.

2026-07-08 systems biology 10.64898/2026.06.16.732208 medRxiv

Top 0.5%

4.1%

Show abstract

Mathematical models of enzyme cycles form the basis of quantifying key features of metabolism and membrane transport. These models are often integrated into more comprehensive models such as whole-cell models to understand emergent behaviours between interacting components. However, it is currently computationally infeasible to simulate the full dynamical behaviour of every enzyme at a network scale. Model reduction is frequently used to improve computational efficiency, but in general, these approaches do not preserve physical and thermodynamic consistency. Here, we outline a general method for simplifying enzyme kinetics models while retaining mass, charge and energy balance. We base our approach on the bond graph, which is a general methodology for modelling biological systems from fundamental physical laws. This approach ensures that key physical constraints are enforced in every model, regardless of their complexity. Our thermodynamic model reduction framework is readily extended to electrogenic transporters through the coupling of chemical and electrical processes. Through the application of our approach to both hypothetical enzyme cycles and real data from the Na+/K+ ATPase, we show that it can rapidly screen for plausible network structures in circumstances where enzyme catalytic mechanisms may not be fully characterised, facilitating biological discovery and drug development.

19

A model for PIP2/3 and Rnd1 effects on Plexin-B1 GAP activity on Rap1b GTPase derived from molecular dynamics simulations

Bhattarai, N.; Sahoo, A. R.; Buck, M.

2026-07-13 biophysics 10.64898/2026.07.09.737506 medRxiv

Top 0.5%

4.0%

Show abstract

Plexin-B1 is a transmembrane receptor that integrates signals from Rho-family and Ras-family (Rap1b) GTPases to regulate cellular processes. While ligand simulated activation of the receptor is largely understood, the role of membrane composition and GTPase allosteric effects on plexin structure, internal protein dynamics, and function is still to be elucidated. Here, we performed multi-replica, 1 s all-atom simulations of Plexin-B1-GTPase complexes on PIP2- and PIP3-containing membranes to investigate the effects of these two signaling lipids, as well as on the GTPases. We found that both Rap1b and Rnd1 stably associate with the membrane, with PIP2 promoting broader lipid engagement and stronger Rap1b-Plexin-B1 interactions, whereas PIP3 enhances Rnd1-Plexin contacts and induces a membrane proximal orientation of Plexins juxtamembrane helix and makes contacts with a previously discovered activation switch loop. Contact map and network analyses revealed lipid-dependent shifts in allosteric communication, with PIP2 favoring Rap1b-centric hotspots and PIP3 favoring Rnd1-centric pathways. These predictions allow us to suggest a model for plexin intracellular region activation where both the identity of phosphoinositides and GTPase context synergistically stabilize Plexin-B1 membrane engagement, alter structural dynamics, and allosteric networks. Thus, we propose that the membrane is an active modulator of plexin receptor signaling.

20

BioMetAll v2.0: Introducing Scores, Metal Discrimination, and Side-Chain Descriptors for Predicting Metal-Binding Sites in Proteins.

Marechal, J. D.; Fernandez Diaz, R.; Pena Losada, R.; Sanchez Aparicio, J. E.; Gao, W.; Alemany, M.

2026-07-12 bioinformatics 10.64898/2026.07.09.737562 medRxiv

Top 0.5%

3.9%

Show abstract

Predicting the location of metal-binding sites in proteins is crucial for fundamental biological questions and biotechnological applications. Over the past decade, the rise in metal-bound protein structures in the Protein Data Bank, combined with advanced statistical models such as deep learning, has accelerated the development of metal-binding site prediction tools. Several approaches are now available, offering high-quality benchmarks and predictive performance. Our initial development in this area is BioMetAll, whose first version was based on backbone pre-organization. Here, we introduce its second version, featuring two major updates: 1) metal-specific scoring functions and 2) prediction using backbone geometry alone or in combination with first coordination sphere descriptors. Apart from demonstrating metal sensitivity and yielding better benchmarking results, this new version allows the assessment of the influence of considering the metals first coordination sphere versus backbone pre-organization on how metallic species bind to proteins.